Separate GC WKS and SVR compilation units#126720
Conversation
|
Tagging subscribers to this area: @agocke, @dotnet/gc |
There was a problem hiding this comment.
Pull request overview
This PR refactors CoreCLR GC build plumbing to stop using wrapper translation units (gcwks.cpp/gcsvr.cpp and gceewks.cpp/gceesvr.cpp) and instead compile the shared GC implementation sources directly into separate WKS/SVR object sets, using a new shared compilation-context header gcinternal.h.
Changes:
- Introduces
gcinternal.hand updates many GC.cppfiles to include it and wrap code inWKS/SVRnamespaces based onSERVER_GC. - Updates CoreCLR VM and standalone GC CMake build graphs to build GC sources as object libraries for WKS/SVR and consume them from coreclr/clrgc targets.
- Updates NativeAOT runtime and the GC sample project build surfaces to consume the new GC source layout.
Reviewed changes
Copilot reviewed 31 out of 33 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/vm/wks/CMakeLists.txt | Adds GC object files into cee_wks_core build inputs. |
| src/coreclr/vm/CMakeLists.txt | Defines shared GC source list and builds vm_gc_wks/vm_gc_svr object libraries. |
| src/coreclr/dlls/mscoree/coreclr/CMakeLists.txt | Links vm_gc_wks/vm_gc_svr into coreclr and coreclr_static. |
| src/coreclr/nativeaot/Runtime/CMakeLists.txt | Replaces wrapper sources with direct GC .cpp compilation for NativeAOT. |
| src/coreclr/gc/gcinternal.h | New shared GC compilation-context header; centralizes includes and inlines. |
| src/coreclr/gc/*.cpp | Switches individual GC implementation files to include gcinternal.h and wrap in WKS/SVR namespaces. |
| src/coreclr/gc/CMakeLists.txt | Builds standalone GC (clrgc/clrgcexp) with WKS/SVR object libraries. |
| src/coreclr/gc/sample/* | Updates GC sample to compile split GC .cpp files directly. |
|
what is the motivation for this change? Does it improve build times? |
It doesn't affect build time in any way. The main reason is code editing experience. The individual files into which the gc.cpp was split in my recent change didn't include the headers for symbols they use, so when editing one of those e.g. in VS code, it was showing a lot of red squiggles and code navigation didn't work well. |
|
Build breaks.... |
|
2% size savings on Hello World on Linux, nice! Size statisticsPull request #126720
|
Do we understand why this makes the code smaller? It is likely making it both smaller and slower (less inlining) ... not something we necessarily want for the GC. It may be a good idea to measure the impact on GC throughput, on both Windows and Linux. This type of refactoring tends to depend on good PGO data and whole program optimizations for good perf:
|
I definitely want to run the GC perf runs and understand where the size improvement comes from before we merge this. |
|
@jkotas performance and functionality tests didn't show any regressions. |
|
Have you measured it on a binary that showed the large code size reduction? (Also, there is a merge conflict that needs to be resolved.) |
|
They run aspnetbenchmarks (https://github.com/dotnet/performance/blob/main/src/benchmarks/gc/GC.Infrastructure/Configurations/ASPNetBenchmarks/ASPNetBenchmarks.csv), GCPerfSim benchmarks (https://github.com/dotnet/performance/tree/main/src/benchmarks/gc/GC.Infrastructure/Configurations/GCPerfSim) and a bunch of hand-picked microbenchmarks (https://github.com/dotnet/performance/blob/main/src/benchmarks/gc/GC.Infrastructure/Configurations/Microbenchmark/MicrobenchmarksToRun.txt) from the dotnet/performance repo on Windows. |
|
I can see that on Windows, the clrgcexp.dll is about 5kB smaller and coreclr.dll about 3.5kB smaller. |
|
I have just noticed that the list of sizes that @MichalStrehovsky has shared contains much larger size changes on linux for the same apps, I am going to do some Linux testing and also compare the disassembly of some of the binaries from Michal's list. |
|
Linux shows substantial changes in inlining and also loop unrolling, I need to run the perf tests on Linux too to see the impact. |
Move the GC sources away from the wrapper-file model that text-included gc.cpp and gcee.cpp under SERVER_GC and instead compile the shared sources directly as separate WKS and SVR objects. This change introduces gcinternal.h as the shared compilation context for the gc.cpp split, converts the former tail-included GC implementation fragments into separately compiled translation units, and updates the GC, VM, NativeAOT, and GC sample build surfaces to consume the new object layout. It also removes the gcsvr.cpp/gcwks.cpp and gceesvr.cpp/gceewks.cpp wrappers, compiles gcee.cpp through the same dual-build WKS/SVR source lists as gc.cpp, deduplicates the repeated WKS/SVR source lists in the relevant CMake files, and renames the shared GC header from gc_common.h to gcinternal.h to avoid confusion with gccommon.cpp. During the split, cross-translation-unit declarations and inline helpers needed by multiple GC source files were moved into the shared header, while local-only inline helpers were moved back into their owning .cpp files to avoid keeping unnecessary bodies in the shared header. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
86fcd07 to
b5ca3ca
Compare
|
I was finally able to verify that it doesn't degrade performance on Linux (I needed to make the GC performance testing framework to work on Linux first). |
| target_link_libraries(clrgcexp PRIVATE gcexp_dll_svr_descriptor) | ||
| endif() | ||
| target_compile_definitions(clrgcexp PRIVATE -DUSE_REGIONS) | ||
| set_property(TARGET clrgcexp PROPERTY INTERPROCEDURAL_OPTIMIZATION TRUE) |
There was a problem hiding this comment.
Why do we set INTERPROCEDURAL_OPTIMIZATION for some targets and not others?
There was a problem hiding this comment.
You are right, it should be added to the clrgc too, I've overlooked it.
There was a problem hiding this comment.
Hmm, I also wonder why I haven't used add_pgo instead here. I'll double check that.
There was a problem hiding this comment.
So, add_pgo would actually require us to have pgo data for the libcoreclr.so and libcoreclrext.so. We currently generate pgo for libcoreclr.so and libclrjit.so only.
When PGO data is missing, we just fall back to not enabling the LTO at all. The PGO and LTO are tight together, PGO is essentially part of LTO.
There seems to also be an issue with object libraries and PGO. add_pgo only adds the LTO options to the final targets, but on Unix, it needs to be specified at the compilation time too. So any code that we compile as object library and then add to e.g. libcoreclr.so would not have PGO enabled. It seems our PGO optimization is broken as many parts of the coreclr are now object libraries.
There was a problem hiding this comment.
Maybe we should enable LTO by default for everything (there is a cmake global property to do that) and change the add_pgo to only add the option to specify the profile file.
There was a problem hiding this comment.
enable LTO by default for everything
We cannot enable LTO for the object files and static libraries that we ship. LTO is tied to specific compiler version that is not guaranteed to be on user's machine.
BTW: We do not collect the PGO data for most of libcoreclr.so currently - #128412 is working towards fixing that.
There was a problem hiding this comment.
Actually, we may still be able to compile with -flto and get functional shipping static libs, using fat LTO. The benefit would be that for object files that go both to shipping static libs and to our shared libs, we could take advantage of LTO for the shared libs case. See https://llvm.org/docs/FatLTO.html. Supported in LLVM 17 and newer. But that's for a separate experiment.
Have you verified impact on NativeAOT too? |
I don't know how to do that. The tool we have for testing and comparing GC perf is the GC perf infrastructure tool in dotnet/performance which uses GCPerfSim, selected microbenchmarks and two selected ASPNet benchmarks. It doesn't support NativeAOT. |
Run a program that spends a lot of time in the GC. Here is an example https://gist.github.com/jkotas/dca7f72bcac821af48f387dfebbc63ba . The GC measurements are inherently noisy, so wait for at least 20 iterations until the number starts to stabilize. On current main, linux-x64 native aot binary size is 1242304 and I see the throughput stabilizing at ~4200ms. With this change (with the exact same main merged), native aot binary size is 1225920 and the throughput stabilizing at ~4300ms. So there is 16kB less code and 2.4% throughput loss. This program spends about 20% in the GC, so the GC itself is ~12% slower. Can you try to reproduce these results? |
|
To publish a project with private build of native aot toolchain:
|
Documented here https://github.com/dotnet/runtime/blob/main/docs/workflow/building/coreclr/nativeaot.md#building-packages. |
These steps are broken in multiple ways - tracked by #119166 . |
|
@jkotas the build This happens for me when building just main without any changes as well as the current PR. |
Do you have cpio prereq? https://github.com/dotnet/runtime/blob/main/docs/workflow/requirements/linux-requirements.md#debian-and-ubuntu |
Ah, I didn't have it on that machine, that's probably the culprit |
|
Hmm, I've ran it under a debugger and it was running with coreclr, so something went wrong with my publish. Trying again. |
|
So on my machine, the diff when really running the NativeAOT version was 0,9%. |
|
It would be best to figure out how to enable lto for the code in static libraries that we ship by applying the lto when we produce the static library. Alternatively, use the low-tech equivalent (e.g. https://cmake.org/cmake/help/latest/prop_tgt/UNITY_BUILD.html) that merges all .cpp files together during the build so that the C++ compiler can still see all code together. Otherwise, we will likely end up with thousands paper cut perf regression over time. |
I don't think that's possible. Static libraries are just bags of object files. Since they are not combined in any way, it seems LTO would not be able to do anything. It seems that the unity build would be the only option for this case. |
Technically, it is possible. By using The CMake Unity Build approach sounds interesting. |
You can combine the object files in a custom way to produce the library. We have started doing a few days ago to deal with libunwind symbol conflicts. The same technique may be used for LTO too #128927 (review) . This technique is compiler-toolchain specific. I do not think there is a publicly documented way to do this with MSVC. Unity build is more portable. |
Move the GC sources away from the wrapper-file model that text-included gc.cpp and gcee.cpp under
SERVER_GCand instead compile the shared sources directly as separate WKS and SVR objects.This change introduces gcinternal.h as the shared compilation context for the gc.cpp split, converts the former tail-included GC implementation fragments into separately compiled translation units, and updates the GC, VM, NativeAOT, and GC sample build surfaces to consume the new object layout.
It also removes the gcsvr.cpp/gcwks.cpp and gceesvr.cpp/gceewks.cpp wrappers, compiles gcee.cpp through the same dual-build WKS/SVR source lists as gc.cpp, deduplicates the repeated WKS/SVR source lists in the relevant CMake files, and renames the shared GC header from gc_common.h to gcinternal.h to avoid confusion with gccommon.cpp.
During the split, cross-translation-unit declarations and inline helpers needed by multiple GC source files were moved into the shared header, while local-only inline helpers were moved back into their owning .cpp files to avoid keeping unnecessary bodies in the shared header.
I've made size comparison between the new clrgc.dll, clrgcexp.dll and coreclr.dll and the changes in the gc dlls were very minor, around ~1.5kB growth due to little different decisions of the linker / compiler w.r.t. cold / hot code. The coreclr even became ~1.5kB smaller.